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Hazard of Overfitting 
Roadmap 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
© How Can Machines Learn? 


Lecture 12: Nonlinear Transform 













nonlinear L] via nonlinear feature transform ® 
plus linear _] with price of model complexity 


© How Can Machines Learn Better? 


Lecture 13: Hazard of Overfitting 


ə What is Overfitting? 
e The Role of Noise and Data Size 
ə Deterministic Noise 

ə Dealing with Overfitting 
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Hazard of Overfitting What is Overfitting? 


Bad Generalization 


regression for x € R with N = 5 
examples 


target f(x) = 2nd order polynomial 
label yn = f(Xn) + very small noise 


linear regression in Z-space + 

® = 4th order polynomial 

unique solution passing all examples 
= En(g)=0 

Eout(g) huge 











bad generalization: low Ein, high Eout 
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Hazard of Overfitting What is Overfitting? 


Bad Generalization and Overtitting 


e take dvc = 1126 for learning: out-of-sample error 
bad generalization 
—(Eout a Ein) large 







model complexity 








z l 
g l 
e switch from dvo = dž, to dvo = 1126: |” i 
overfitting l 
—Ein I. Zour T in-sample error 
e switch from Ac = ae to Ac =F do VC dimension, dyc 
underfitting 


— Cin T Eout i 





bad generalization: low Ein, high Eout; 
overfitting: lower En, higher Eout 
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Hazard of Overfitting What is Overfitting? Selle mon 
Cause of Overfitting: A Driving Analogy 
O Data 


— Target 
=t 


















‘good fit’ => overfit 
learning | driving 
overtit commit a car accident 
use excessive Ayo ‘drive too fast’ 
noise bumpy road 


limited data size N | limited observations about road condition 





next: how does noise & data size affect 
overfitting? | 
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Hazard of Overfitting What is Overfitting? 


Fun Time 


Based on our discussion, for data of fixed size, which of the following 
situation is relatively of the lowest risk of overfitting? 


@ small noise, fitting from small ay. to median dyco 
@ small noise, fitting from small dyc to large Ac 
© large noise, fitting from small ayc to median ayc 
© large noise, fitting from small ayc to large dvc 
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Hazard of Overfitting What is Overfitting? 


Fun Time 


Based on our discussion, for data of fixed size, which of the following 
situation is relatively of the lowest risk of overfitting? 


@ small noise, fitting from small ay. to median dyco 
@ small noise, fitting from small dyc to large Ac 
© large noise, fitting from small ayc to median ayc 
© large noise, fitting from small ayc to large dvc 


Reference Answer: D) 


Two causes of overfitting are noise and 
excessive dyc. So if both are relatively ‘under 
control’, the risk of overfitting is smaller. 
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Hazard of Overfitting The Role of Noise and Data Size 


Case Study (1/2) 


10-th order target function 50-th order target function 


O Data O Data 
— Target — Target 





overfitting from best go € Hə to best gin € H10? j 
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Hazard of Overfitting The Role of Noise and Data Size 


Case Study (2/2) 


10-th order target function 50-th order target function 


















O Data 
— 2nd Order Fit 
—10th Order Fit 


O Data 
—2nd Order Fit 
—10th Order Fit 


x 


| 92 € He G10 € Hio 
Ei 0.050 0.034 
E 0.127 9.00 


| 92 EH2 G10 € Hio 
Ei 0.029 0.00001 
E 0.120 7680 


overfitting from g> to 919? both yes! 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22 














Hazard of Overfitting The Role of Noise and Data Size 


lrony of Two Learners 










O Data 
—2nd Order Fit 
—10th Order Fit 


z x 
e learner Overfit: pick gio € H10 
e learner Restrict: pick g2 € H2 


e when both know that target = 10th 
—R ‘gives up’ ability to fit 


but R wins in Eou a lot! 
philosophy: concession for advantage? :-) | 
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Hazard of Overfitting The Role of Noise and Data Size 


Learning Curves Revisited 
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Number of Data Points, N Number of Data Points, N 








e H40: lower Eout when N > oo, 
but much larger generalization error for small N 


° [Gray area): O overfits! (ea an) 
R always wins in Eut if N small! | 
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Hazard of Overfitting The Role of Noise and Data Size 


The ‘No Noise’ Case 






> > 
O Data 
—2nd Order Fit 
— 10th Order Fit 2y 
£ x 


e learner Overfit: pick gio € H10 
e learner Restrict: pick go € He 


e when both know that there is no 
noise —A still wins 





is there really no noise? 
‘target complexity’ acts like noise 
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Hazard of Overfitting The Role of Noise and Data Size 


Fun Time 


When having limited data, in which of the following case would 
learner R perform better than learner O? 


@ limited data from a 10-th order target function with some noise 
@ limited data from a 1126-th order target function with no noise 
© limited data from a 1126-th order target function with some noise 
@ all of the above 
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Hazard of Overfitting The Role of Noise and Data Size 


Fun Time 


When having limited data, in which of the following case would 
learner R perform better than learner O? 
@ limited data from a 10-th order target function with some noise 
@ limited data from a 1126-th order target function with no noise 
© limited data from a 1126-th order target function with some noise 


@ all of the above 








Reference Answer: © 


We discussed about (4) and (2), but you shall | 
be able to ‘generalize’ :-) that R also wins in 


the more difficult case of (3). 
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Hazard of Overfitting Deterministic Noise 


A Detailed Experiment 







y = ere 
Q; 
~ Gaussian X oat) il 
q=0 
— 
f(x) O Data 


— Target 


e Gaussian iid noise « with level o? 


e some ‘uniform’ distribution on f(x) 
with complexity level Q; 


e data size N 





goal: ‘overfit level’ for 
different (N, o?) and (N, Qy)? | 
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Hazard of Overfitting Deterministic Noise 


The Overfit Measure 






O Data 
— 2nd Order Fit 
— 10th Order Fit 





e 92 E€ H2 
e gio E Hio 
e En(g10) < En(92) for sure 


overfit measure Eout(g10) — Eout(92) | 
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: eA 
Noise Level, g4 





The Results 


Number of Data Points, N 
fixed Q; = 


A 


0.1 


= 


aq 
= 


Target Complexity, Qz 





-0.2 


80 100 120 
Number of Data Points, N 
fixed o? = 0.1 





Hazard of Overfitting Deterministic Noise 


Impact of Noise and Data Size 


impact of o? versus N: impact of Q; versus N: 





1 


Noise Level, o? 
iS 





Target Complexity, Q 


2 
0.1 
0 
il 
-0.1 
0! -0.2 


80 100 . 120 80 100 120 
Number of Data Points, N Number of Data Points, N 











data size N | overfit + 
stochastic noise f overfit + 
deterministic noise t overfit t 
excessive power + overfit + 


four reasons of serious overfitting: 














overfitting ‘easily’ happens 
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Hazard of Overfitting Deterministic Noise 


Deterministic Noise 


e if f € H: something of f cannot be h* 
captured by H 

deterministic noise : difference 

between best h* € H and f > 

e acts like ‘stochastic noise’ —not new 

to CS: pseudo-random generator f 

difference to stochastic noise: 


e depends on H. T 
e fixed for a given x 





philosophy: when teaching a kid, 
perhaps better not to use examples 
from a complicated target function? :-) 
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Hazard of Overfitting Deterministic Noise 


Fun Time 


Consider the target function being sin(1126x) for x € [0,27]. When x 
is uniformly sampled from the range, and we use all possible linear 
hypotheses h(x) = w - x to approximate the target function with 
respect to the squared error, what is the level of deterministic noise for 
each x? 


© |sin(1126x)| 

@ |sin(1126x) — x| 

© |sin(1126x) + x| 

© |sin(1126x) — 1126x| 
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Hazard of Overfitting Deterministic Noise 


Fun Time 


Consider the target function being sin(1126x) for x € [0,27]. When x 
is uniformly sampled from the range, and we use all possible linear 
hypotheses h(x) = w- x to approximate the target function with 
respect to the squared error, what is the level of deterministic noise for 
each x? 


© |sin(1126x)| 

@ |sin(1126x) — x| 

© | sin(1126x) + x| 

© |sin(1126x) — 1126x| 


Reference Answer: ‘aD 


You can try a few different w and convince 
yourself that the best hypothesis h* is 
h*(x) = 0. The deterministic noise is the 
difference between f and h*. ] 
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Hazard of Overfitting Dealing with Overfitting 


Driving Analogy Revisited 











learning driving 
overfit commit a car accident 
use excessive dyco ‘drive too fast’ 
noise bumpy road 
limited data size N limited observations about road condition 
start from simple model drive slowly 


data cleaning/pruning 
data hinting 
regularization 
validation 





use more accurate road information 
exploit more road information 
put the brakes 
monitor the dashboard 





all very practical techniques 
to combat overfitting | 
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Hazard of Overfitting Dealing with Overfitting 


Data Cleaning/Pruning 









































e if ‘detect’ the outlier 5 at the top by 


e too close to other o, or too far from other x 
e wrong by current classifier 
Oe 


e possible action 1: correct the label (data cleaning) 
e possible action 2: remove the example (data pruning) 





possibly helps, but effect varies 
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Hazard of Overfitting Dealing with Overfitting 


Data Hinting 
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e slightly shifted/rotated digits carry the same meaning 


e possible action: add virtual examples by shifting/rotating the 
given digits (data hinting) 


possibly helps, but watch out 
—virtual example not S P(x,y)! | 
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Hazard of Overfitting Dealing with Overfitting 


Fun Time 


Assume we know that f(x) is symmetric for some 1D regression 
application. That is, f(x) = f(—x). One possibility of using the 
knowledge is to consider symmetric hypotheses only. On the other 
hand, you can also generate virtual examples from the original data 
{(Xn, Yn)} as hints. What virtual examples suit your needs best? 

@ {(%n, -Yn)} 
@ {(—Xn, —Yn)} 
© {(—Xn, Yn)} 
© {(2xn, 2Yn)} 
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Hazard of Overfitting Dealing with Overfitting 


Fun Time 


Assume we know that f(x) is symmetric for some 1D regression 
application. That is, f(x) = f(—x). One possibility of using the 
knowledge is to consider symmetric hypotheses only. On the other 
hand, you can also generate virtual examples from the original data 
{(Xn, Yn) } as hints. What virtual examples suit your needs best? 

@ {(Xn,—Yn)} 
© {(—Xn, —Yn)} 
© {(—Xn, Yn)} 
© {(2xn, 2Yn)} 





Reference Answer: 6) 


We want the virtual examples to encode the 
invariance when x —> —x. 
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Hazard of Overfitting Dealing with Overfitting 


Summary 
@ When Can Machines Learn? 


© Why Can Machines Learn? 
© How Can Machines Learn? 


Lecture 12: Nonlinear Transform 


© How Can Machines Learn Better? 


Lecture 13: Hazard of Overfitting 


ə What is Overfitting? 
lower En but higher Eout 
ə The Role of Noise and Data Size 
overfitting ‘easily’ happens! 
ə Deterministic Noise 
what H cannot capture acts like noise 
ə Dealing with Overfitting 
data cleaning/pruning/hinting, and more 





e next: putting the brakes with regularization 
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